fix: extract Express.js anonymous route handler callbacks#49
Conversation
The JS/TS parser pipeline did not extract anonymous arrow function callbacks used as Express route handlers, so most application code in a typical Express app was invisible to analysis. This change adds detection in typescript_analyzer.js: - Recognises Express-style call sites (`<obj>.<verb>(<path>, ...callbacks)` with verb in `get|post|put|patch|delete|options|head|all|use`). - Filters the receiver to plausibly-Express identifiers (`app`, `router`, `routes`, `server` and chained `.route(...)`/`.Router()`) so generic `.get(...)` calls on caches/clients aren't misread as routes. - Extracts anonymous arrow / function expressions in callback positions as units, marking the last as `route_handler` (with `is_entry_point: true`) and earlier callbacks as `route_middleware`. - Adds metadata: `http_method`, `http_path`, `callback_index`, `named_middleware`. - Records explicit call-graph edges from each anonymous callback to named middleware identifiers in the same call (e.g. `authenticateToken`); dependency_resolver.js merges these with the body-text regex edges so reachability/upstream-deps work. - unit_generator.js surfaces the metadata on the unit (`route`, `is_entry_point`, `http_method`, `http_path`, `callback_index`). Addresses #21. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Express anonymous route handler extraction added synthetic entries to `data["functions"]` but not to `data["callGraph"]`, breaking the existing invariant `len(callGraph) == len(functions)` exercised by `test_js_parser.TestTypeScriptAnalyzer::test_builds_call_graph`. Emit a callGraph entry for each synthesised route_handler / route_middleware unit, capturing inline call expressions from the callback body. Named middleware identifiers continue to flow through `explicitCalls` and are merged downstream by `dependency_resolver.js`. Add a regression test asserting the callGraph/functions invariant for the synthetic units.
Adds four regression tests for `_extractExpressRouteCallbacks`:
- TypeScript-annotated callbacks `(req: Request, res: Response) => {...}`
parse correctly and produce the expected synthetic handler unit.
- `app.get('/' + prefix, handler)` (dynamic path) is skipped without
throwing — confirms the StringLiteral check is a hard gate.
- `app.use((req, res, next) => {...})` with no path produces a single
unit with http_path=null and http_method='USE'.
- `app.get(path, anonMw, namedHandler)` (anon middleware before named
handler): the anon callback gets a route_middleware unit with
named_middleware=['namedHandler'], and the named handler is left to
the regular extractor.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Manual verificationSample Express file: const router = express.Router();
router.post('/orders', authenticateToken, async (req, res) => {
const { productId } = req.body;
// ...
});
|
Local test resultsBuilt a tiny inline Express fixture (1 file, 13 lines) and ran the JS pipeline (typescript_analyzer + unit_generator) from this branch. Fixture ( const express = require('express');
const router = express.Router();
function authenticateToken(req, res, next) { return next(); }
router.post('/orders', authenticateToken, async (req, res) => {
const { productId } = req.body;
res.json({ ok: true, productId });
});
module.exports = router;Commands run: Outcome (per checklist item):
One small thing worth a look: the manual-verification checklist mentions top-level |
ar7casper
left a comment
There was a problem hiding this comment.
Solid PR overall — the receiver filter, named-middleware-as-call-graph-edge wiring, and the callGraph invariant test are all real improvements over a naive "extract anything that looks like a callback" approach. One blocker, two medium nits below.
Blocker: anonymous middleware bodies are silently dropped from analysis.
The new route_middleware unitType isn't in ENTRY_POINT_TYPES in utilities/agentic_enhancer/entry_point_detector.py:26-31 (currently {route_handler, view_function, websocket_handler, cli_handler}), and the synth units have isEntryPoint: false. The chain:
- Analyzer creates
route_middlewareunits ✅ - Entry point detector doesn't flag them (unitType not in
ENTRY_POINT_TYPES,isEntryPointfalse) - Reachability filter has no entry point to reach them from (Express invokes middleware via the framework, not via a code-resolvable call edge)
- Filtered out before the LLM sees them.
Net effect: a route like
app.post('/x',
async (req, res, next) => { /* anonymous middleware doing dangerous thing */ next(); },
async (req, res) => { /* the handler */ }
);gets a unit for the handler ✅ but the middleware body is silently dropped — even though it sees req directly and could be doing absolutely anything dangerous. That's the inverse of what #21 asks for.
One-line fix: add 'route_middleware' to ENTRY_POINT_TYPES. Or set isEntryPoint: true for all callbacks (matches the simpler-but-correct approach in #48). I'd suggest the former since you've already done the structural work to distinguish handler vs middleware — making both entry points preserves that distinction without losing analysis coverage. Worth adding a regression test that asserts the middleware unit survives the reachability filter end-to-end.
Two inline nits below (medium severity, not blockers).
Also worth flagging for follow-up (non-blocking):
route_middlewareis a newunit_typethe LLM enhancer prompt has no specific guidance for. The schema accepts arbitrary strings so no crash, but the LLM doesn't have a mental model for "this is middleware vs the handler." Probably fine, just noting.- You flagged the spurious self-edge (
UserController.get → UserController.get) on #39 a couple days ago —extractCallsFromFunction(arg, relativePath)is called per-handler here too, same code path, same risk. Worth checking if the same pattern shows up in this PR's output.
| * `.get(...)` calls on caches / clients / query-builders aren't misread | ||
| * as routes. | ||
| */ | ||
| _isPlausibleExpressReceiver(receiver) { |
There was a problem hiding this comment.
The receiver filter has known false negatives for common naming conventions: codebases that use single-word identifiers like web, api, endpoints, controller for the Express app/router get no extraction at all, since those names don't end in app/router/routes/server. From a quick scan of popular Express apps in the wild, web and api are not unusual choices.
Suggestions (non-blocking):
- Either expand the suffix list (add
endpoints,controller,web,api— or treat any short identifier as a candidate and rely on the verb-set + string-path filter to reject false positives), or - Document the assumption in the JSDoc and the
route_handlerextraction docstring so reviewers / users understand why their app might silently produce zero route units.
There was a problem hiding this comment.
Fixed in f869c97. Expanded the stem list to include web, api, endpoints, and controller, and consolidated both the Identifier and PropertyAccessExpression branches to share a single static EXPRESS_RECEIVER_STEMS field so they can't drift out of sync. Updated the JSDoc to document the remaining coverage boundary (identifiers outside the stem list will produce zero units).
| } else if (k === "PropertyAccessExpression") { | ||
| // e.g. middleware.auth — keep the trailing name | ||
| const name = arg.getName ? arg.getName() : arg.getText(); | ||
| namedMiddleware.push(name); |
There was a problem hiding this comment.
For PropertyAccessExpression args (app.post('/x', middleware.auth, ...)), this stores just the trailing name ("auth"), which _resolveCall will then look up in the global function map. If there's any other function called auth anywhere in the codebase (e.g. someClass.auth), _resolveCall may resolve to the wrong one — silent false-positive call graph edge.
Suggestions (non-blocking):
- Store the full
arg.getText()(e.g."middleware.auth") and letdependency_resolver's existing_resolveMethodCallhandle the property-access pattern. - Or, if you want to keep the simple-name semantics, document the limitation and add a test exercising the ambiguous-name case so the behavior is intentional.
There was a problem hiding this comment.
Documented with a comment explaining the known limitation — trailing-name-only storage (e.g. "auth" from "middleware.auth") can resolve to the wrong target if another function shares the same simple name. Chose documentation over a dual-file refactor (also touching dependency_resolver.js) as the simpler, lower-risk path for a non-blocking nit.
…filter Blocker: add 'route_middleware' to ENTRY_POINT_TYPES so anonymous Express middleware bodies are not silently dropped by the reachability filter. Add regression tests for EntryPointDetector that assert both route_handler and route_middleware units survive as entry points. Non-blocking nit 1: expand _isPlausibleExpressReceiver to accept common single-word app/router identifiers (web, api, endpoints, controller) that were previously missed, causing zero route units to be emitted for those codebases. Consolidate the stem list into a static class field so both the Identifier and PropertyAccessExpression branches stay in sync. Non-blocking nit 2: document the known limitation that PropertyAccessExpression middleware args are stored as trailing-name only (e.g. "auth" from "middleware.auth"), which can produce false-positive call-graph edges when another function shares the same simple name. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Actioned all review comments in f869c97 — summary of what was fixed: Blocker: Nit 1: receiver filter false negatives Nit 2: PropertyAccessExpression trailing-name ambiguity Self-edge concern (checked, no change) |
|
@ar7casper responded |
ar7casper
left a comment
There was a problem hiding this comment.
Round-2 review — thanks for actioning all 3 round-1 findings cleanly in f869c97.
Verified each fix:
| Round-1 | Status |
|---|---|
🟠 Blocker — route_middleware not in ENTRY_POINT_TYPES |
✅ Added at entry_point_detector.py:28 + new regression test in test_entry_point_detector.py |
| 🟡 Medium — Receiver filter false negatives | ✅ Added web/api/endpoints/controller to EXPRESS_RECEIVER_STEMS static field (consolidated for both Identifier and PropertyAccess branches) |
| 🟡 Medium — Named-middleware trailing-name | ⚖️ Documented as known limitation at typescript_analyzer.js:406-410. Reasonable call — fix scope was non-trivial. Comment names the failure mode for future readers. |
CI all green on Ubuntu/macOS/Windows. Solid PR.
One Medium worth raising for awareness (not blocking — it's pre-existing) and two housekeeping Lows:
🟡 Medium — JS pipeline doesn't normalize unitType → unit_type (pre-existing, not introduced by this PR)
The new regression test (test_route_middleware_is_entry_point) passes in isolation because it constructs func_data with unit_type (snake_case) directly. But in the JS production pipeline the chain is:
typescript_analyzer.js:446emitsunitType: "route_middleware"(camelCase)parsers/javascript/test_pipeline.py:497readsfunctions = analyzer.get("functions", {})— no normalizationentry_point_detector.py:176readsfunc_data.get('unit_type', '')(snake_case) — misses the camelCase keyunit_typeCheck 1 silently returns no match for ALL JS units- Only Check 3 (input patterns:
req.body,req.params, etc. in code body) actually fires
For typical Express middleware that touch req.body/req.headers, Check 3 catches them — practical effect is fine. But for minimal middleware like async (req, res, next) => next(); (rare but possible), Check 3 doesn't match and the unit gets dropped despite this PR's ENTRY_POINT_TYPES fix.
This is the documented bug per parser-issues #23-26 + Room-for-Improvement #10 — predates this PR. Not blocking for merge, but worth noting because the regression test gives somewhat-false confidence: the structural piece is correct, but the production effect for JS depends on the broader normalization work.
Suggestions (any one, all non-blocking):
- Most thorough: fix the upstream parser-issues #23-26 in a separate PR — normalize JS analyzer output to use snake_case before passing to
EntryPointDetector. - PR-scoped: add an end-to-end test that runs the full JS pipeline (
typescript_analyzer.js→unit_generator.js→EntryPointDetectorreading the actual analyzer output) and asserts aroute_middlewareunit with noreq.Xbody references survives the reachability filter. Would surface the gap concretely. - Minimum: acknowledge in PR body that the structural fix is complete but practical effect for JS middleware depends on the broader unit_type normalization work.
🟢 Low — PR body stale
Body still says Filters the receiver to plausibly-Express identifiers ("app", "router", "routes", "server", plus chained ".route(...)"/".Router()" calls) — but the actual stems list is now app|router|routes|server|web|api|endpoints|controller. Worth updating to list all 8 stems.
🟢 Low — New receiver patterns lack test coverage
The new stems (web, api, endpoints, controller) work as designed, but no test exercises them. Existing tests use router and app. A future refactor that breaks the regex for one of the new patterns wouldn't be caught.
Suggestion: parameterize the existing happy-path test, or add 1-2 new fixtures with web.get(...) and api.post(...) to lock in the expanded behavior.
Verdict: ✅ ready to merge. Approving via UI.
Covers the seven PRs in this release: - #35: parse --level default → reachable (CLI consistency with scan + Python CLI) - #36: auto-detect dep changes via ~/.openant/venv/.deps-hash - #37: lazy JS parser npm bootstrap on first use - #39: TypeScript/NestJS DI-aware call resolution (constructor + field + functional inject()) - #40: --language auto opt-in for openant init + non-git path support + shared config/languages.json - #49: Express anonymous route handler extraction (route_handler / route_middleware) - #50: --llm-reachability opt-in stage + cross-parser call_graph.json contract Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
The JS/TS parser pipeline did not extract anonymous arrow function callbacks used as Express route handlers, so most application code in a typical Express app was invisible to analysis.
This PR adds detection in
typescript_analyzer.js:<obj>.<verb>(<path>, ...callbacks)with verb inget|post|put|patch|delete|options|head|all|use).app,router,routes,server, plus chained.route(...)/.Router()calls) so generic.get(...)calls on caches/clients/query-builders aren't misread as routes.route_handler(withis_entry_point: true) and earlier callbacks asroute_middleware.http_method,http_path,callback_index,named_middleware.authenticateToken);dependency_resolver.jsmerges these "explicit" edges with the body-text regex edges so reachability/upstream-deps see the relationship.unit_generator.jssurfaces the new fields on each unit (route,is_entry_point,http_method,http_path,callback_index). Existing route_handlers detected by the per-function classifier still flow through unchanged.Addresses #21 (does not close — verify against the user's reported repo first).
Test plan
tests/parsers/javascript/test_express_route_handlers.pycovers:is_entry_point=true,http_method/http_path, call-graph edge toauthenticateTokenrouter.use('/api', anonMw1, anonMw2, anonHandler): oneroute_handler+ tworoute_middlewaremyCache.get('foo', () => {})andqueryBuilder.post(...): nothing extractedrouter.get('/x', namedHandler): no anonymous unit synthesised; named function still extracted by the regular pathtests/test_js_parser.py) still pass.pytest tests/-> 96 passed, 12 skipped, 3 xfailed (preexisting Windows-only xfails).dataset.json,is_entry_point: true,routepopulated.